{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TF-IDF\n",
    "\n",
    "调用：\n",
    "        在下面的代码段中，我们以一组句子开始。首先使用分解器Tokenizer把句子划分为单个词语。对每一个句子（词袋），我们使用HashingTF将句子转换为特征向量，最后使用IDF重新调整特征向量。这种转换通常可以提高使用文本特征的性能。然后，我们的特征向量可以在算法学习中[plain]view plaincopy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import HashingTF, IDF, Tokenizer  \n",
    "  \n",
    "sentenceData = spark.createDataFrame([  \n",
    "    (0, \"Hi I heard about Spark\"),  \n",
    "    (0, \"I wish Java could use case classes\"),  \n",
    "    (1, \"Logistic regression models are neat\")  \n",
    "], [\"label\", \"sentence\"])  \n",
    "tokenizer = Tokenizer(inputCol=\"sentence\", outputCol=\"words\")  \n",
    "wordsData = tokenizer.transform(sentenceData)  \n",
    "hashingTF = HashingTF(inputCol=\"words\", outputCol=\"rawFeatures\", numFeatures=20)  \n",
    "featurizedData = hashingTF.transform(wordsData)  \n",
    "# CountVectorizer也可获取词频向量  \n",
    "  \n",
    "idf = IDF(inputCol=\"rawFeatures\", outputCol=\"features\")  \n",
    "idfModel = idf.fit(featurizedData)  \n",
    "rescaledData = idfModel.transform(featurizedData)  \n",
    "for features_label in rescaledData.select(\"features\", \"label\").take(3):  \n",
    "    print(features_label)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word2Vec\n",
    "\n",
    "        Word2vec是一个Estimator，它采用一系列代表文档的词语来训练word2vecmodel。该模型将每个词语映射到一个固定大小的向量。word2vecmodel使用文档中每个词语的平均数来将文档转换为向量，然后这个向量可以作为预测的特征，来计算文档相似度计算等等。\n",
    "\n",
    "        在下面的代码段中，我们首先用一组文档，其中每一个文档代表一个词语序列。对于每一个文档，我们将其转换为一个特征向量。此特征向量可以被传递到一个学习算法。\n",
    "调用："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import Word2Vec\n",
    " \n",
    "# Input data: Each row is a bag of words from a sentence or document.\n",
    "documentDF = spark.createDataFrame([\n",
    "    (\"Hi I heard about Spark\".split(\" \"), ),\n",
    "    (\"I wish Java could use case classes\".split(\" \"), ),\n",
    "    (\"Logistic regression models are neat\".split(\" \"), )\n",
    "], [\"text\"])\n",
    "# Learn a mapping from words to Vectors.\n",
    "word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol=\"text\", outputCol=\"result\")\n",
    "model = word2Vec.fit(documentDF)\n",
    "result = model.transform(documentDF)\n",
    "for feature in result.select(\"result\").take(3):\n",
    "    print(feature)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Countvectorizer\n",
    "\n",
    "     Countvectorizer和Countvectorizermodel旨在通过计数来将一个文档转换为向量。当不存在先验字典时，Countvectorizer可作为Estimator来提取词汇，并生成一个Countvectorizermodel。该模型产生文档关于词语的稀疏表示，其表示可以传递给其他算法如LDA。\n",
    "\n",
    "     在fitting过程中，countvectorizer将根据语料库中的词频排序选出前vocabsize个词。一个可选的参数minDF也影响fitting过程中，它指定词汇表中的词语在文档中最少出现的次数。另一个可选的二值参数控制输出向量，如果设置为真那么所有非零的计数为1。这对于二值型离散概率模型非常有用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import CountVectorizer\n",
    " \n",
    "# Input data: Each row is a bag of words with a ID.\n",
    "df = spark.createDataFrame([\n",
    "    (0, \"a b c\".split(\" \")),\n",
    "    (1, \"a b b c a\".split(\" \"))\n",
    "], [\"id\", \"words\"])\n",
    " \n",
    "# fit a CountVectorizerModel from the corpus.\n",
    "cv = CountVectorizer(inputCol=\"words\", outputCol=\"features\", vocabSize=3, minDF=2.0)\n",
    "model = cv.fit(df)\n",
    "result = model.transform(df)\n",
    "result.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
